CREATED BY-
NAJEEB FARIDUDDIN SAIYED PRN 21070126057
KOTA SRINIVAS PRN 21070126050
Customer churn is always one of the things that is very essential especially in telecommunication industry because they have to understand and analyze trends of customers if there going to unsubscribe from the firm or not. This is where Machine learning algorithm helps companies in predicting customers behaviours.
What is a Customer Churn?
Customer churn is basically when subscribers or customers discontinue using a firms services.
Customers have a variety of option to select between telecom industry making the annual churn rate of telecommunication bussiness between 15-25 percent which is highly competetive.
Individualized customer retention is tough because most firms have a large number of customers and can't afford to devote much time to each of them. The costs would be too great, outweighing the additional revenue. However, if a corporation could forecast which customers are likely to leave ahead of time, it could focus customer retention efforts only on these "high risk" clients. The ultimate goal is to expand its coverage area and retrieve more customers loyalty. The core to succeed in this market lies in the customer itself.
A cool fun fact - "Did you know that attracting a new customer costs five times as much as keeping an existing one?"
To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.
To find any signs of churn a company must develop a holistic view and their interactions with the services provided by the company inluding store/branch visits, product purchase histories, customer service calls, Web-based transactions, and social media interactions, etc.
By analysing their customers well, these businesses may stand against their fellow compititors but also grow and thrive forward. Thus the company's key focus for success is retaining customers and implementing a effective retention strategy.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
#reading dataset1
df = pd.read_csv("C:\\Users\\Najeeb\\Desktop\\WA_Fn-UseC_-Telco-Customer-Churn (1).csv")
#displaying first 5 rows of the dataset
df.head()
#displaying all columsn to get a better understanding of all the columns present in the dataset
df.columns
#shape of the data
df.shape
print("Number of rows are: ", df.shape[0])
print("Number of columns are: ", df.shape[1])
df.index
#displaying the information of this dataset
df.info()
describing the data
#describing the data
df.describe(include=np.object)
#here its displays all column which are of 'object' datatype
#cuz without it.. it would just display count, mean, etc for columns with integer datatype
#top gives the highest counted value of the categorical values.
df.describe()
-Here SeniorCitizen has an improper distribution of 25%-50%-75% and the reason is it being categorical
-75% customers have tenure less than 55 months
-Avg Monthly charges are $64.7
df['Contract'].unique()
df['PaymentMethod'].unique()
df['InternetService'].unique()
#Dropping customerID column since we have no need for its use
customerID = df['customerID'] #saving into customerID so that maybe later we need to acess it
df = df.drop(['customerID'], axis = 1)
df.head()
#time to change Total Charges column to integer type
df.info()
#since the TotalCharges column was an object so we are going to convert that to integer
df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce') #It will replace all non-numeric values with NaN.
df['TotalCharges'] = df['TotalCharges'].astype("float")
df.isnull().sum()
After converting the column to numeric (because if its in integer we can apply different mathematical operations on it) after converting to numeric data type we see that there are actually missing values (blankspaces)
#Rows with Null value in Total Charges - Column
df.loc[df['TotalCharges'].isnull() == True]
We are going to drop the rows with null values and readjust our index using reset_index function
df.dropna(how = 'any', inplace=True)
df = df.reset_index(drop=True)
#how: any or all value.
#any’ drops the row/column if ANY value is Null and ‘all’ drops only if ALL values are null.
#axis =1 or “columns”.
#axis 0 for rows
#cross checking if there are any null values in our dataset.
df.isna().sum()
#since the SeniorCitizen column is in 0's and 1's we are going to
#convert it to Yes or No just like other columns like Partners, Dependents, PhoneService, etc.
#to make consistensy with our data values in the other column.
df["SeniorCitizen"]= df["SeniorCitizen"].map({0: "No", 1: "Yes"})
df.head()
Since presence of all null values are now cleared, there are no more null values in the dataset. thus, we are going to proceed with exploratory data analysis.
Using Label Encoder
df2 = df.copy()
from sklearn.preprocessing import LabelEncoder
def object_to_int(dataframe_series):
if dataframe_series.dtype=='object':
dataframe_series = LabelEncoder().fit_transform(dataframe_series)
return dataframe_series
df = df.apply(lambda x: object_to_int(x))
df.head()
#plotting same pie chart using matplotlib
values1=df['gender'].value_counts()
g_labels = ['Male', 'Female']
plt.figure(figsize=(8.5, 8.5))
colors = sns.color_palette('pastel')[0:5] #bright ,etc. are some color palletes
plt.pie(values1, labels = g_labels, explode = [0.07,0],colors=colors, shadow = True)
plt.legend(title="Gender")
plt.show()
#using plotly to plot a interative pie chart
values1=df['gender'].value_counts()
g_labels = ['Male', 'Female']
colors = ['lightblue', 'pink']
fig = px.pie(labels=g_labels, values=values1)
fig.add_trace(go.Pie(labels=g_labels,values=values1))
fig.update_traces(hole=0.5, hoverinfo ="label+percent", textfont_size=20, marker=dict(colors = colors))
fig.update_layout(title_text="<b>Gender Distribution</b>", annotations=[dict(text='Gender', x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.show()
Customers are 49.5 % female and 50.5 % male.
values1=df['Churn'].value_counts()
c_labels = ['No','Yes']
colors = ['#2BE592', '#E96D73']
fig = px.pie(labels=c_labels, values=values1)
fig.add_trace(go.Pie(labels=c_labels,values=values1))
fig.update_traces(hole=0.5, hoverinfo ="label+percent", textfont_size=20, marker=dict(colors=colors))
fig.update_layout(title_text="Churn Distribution",
annotations=[dict(text='Churn', x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.show()
26.6 % of customers switched to another firm. Around 73% of our customers preferred our services than other products and services offerred by other firms.
plt.figure(figsize=(6, 6))
labels =["Churn: Yes","Churn:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ['#ff6666', '#66b3ff']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
explode = (0.3,0.3)
explode_gender = (0.4,0.4,0.3,0.3)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct='%1.1f%%',pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90, explode=explode,radius=10, textprops =textprops )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),4.3, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Churn Distribution w.r.t. to Gender', fontsize=15, y=1.1)
# show plot
plt.axis('square') #it will make a square plot ie., its going to plot with equal axes
plt.tight_layout()
plt.show()
Here we see both genders behaved in similar fashion when it comes to migrating to another service provider.
graph = sns.kdeplot(df['MonthlyCharges'][(df['Churn'] == 1) ],color ='#ff528f',shade = True)
graph.legend(["Churn: Yes"], loc = 'upper right')
graph.set_ylabel('Density')
graph.set_xlabel('Monthly Charges')
graph.set_title('Churn Distribution w.r.t. Monthly Charges')
More the monthly charges the higher the chances of customer churning.
df_corr = df.corr()
fig = go.Figure()
fig.add_trace(go.Heatmap(x = df_corr.columns,y = df_corr.index,z = np.array(df_corr),text=df_corr.values,texttemplate='%{text:.2f}'))
fig.update_layout(autosize=False,width=1000,height=1000,margin=dict(l=50,r=50,b=100,t=100,pad=4))
fig.show()
This heatmap is super useful as it helps to identify the highly correlated variables and this will allow them to streamline the feature selection process.
#univariate analysis
for i, predictor in enumerate(df2.drop(columns=['Churn','TotalCharges','MonthlyCharges'])):
plt.figure(figsize=(20,8))
plt.figure(i)
sns.countplot(data=df2, x= predictor, hue = 'Churn')
CONCLUSION
-Electronic check are the highest churners
-No online security, no tech support, No device support category are high churners
-Non senior citizen are high churners
-Customers with Non paperless billing are high churners
example if u dont have dependent u r likely to churn.. just look at yes dependent and no dependent then compare both orange column and esee which has high value that is going to churn
df_yes = df2[df2['Churn']=='Yes']
fig = px.histogram(df2, x="Churn", color="Contract", barmode="group",title="<b>Contract distribution</b>")
fig.update_layout(width =800, bargap=0.1)
fig.show()
Customers with Month-to-month are likely to churn when compared to one year and two year contracts.
Monthly customers are more likely to churn because of no contract terms and they are free to go.
About 75% of customer with Month-to-Month Contract opted to move out as compared to 13% of customrs with One Year Contract and 3% with Two Year Contract
df_yes = df2[df2['Churn']=='Yes']
color_map = {"Yes": "#0000FF", "No": "#FF00FF"}
fig = px.bar(df2, x="Contract", color="Churn", barmode="group", facet_col="gender", color_discrete_map=color_map,title="<b>Contract distribution w.r.t. Gender</b>",
category_orders={"Contarct": ["Month-to-month","One year","Two year"],"Churn": ["Yes", "No"],"gender": ["Male", "Female"]})
fig.show()
The distribution is almost same in Male and Female
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df_yes, x="Churn", color="Dependents", barmode="group", title="<b>Dependents distribution</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
From the graph we can intepret that customers with no dependent are more likely to churn
color_map = {"Yes": '#00CC96', "No": '#B6E880'}
fig = px.histogram(df2, x="Churn", color="SeniorCitizen", title="<b>Chrun distribution w.r.t. Senior Citizen</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
We can intepret that customers who are not Senior Citizen are more likely to churn
color_map = {"Yes": '#00CC96', "No": '#B6E880'}
fig = px.histogram(df2, x="Churn", color="PhoneService", title="<b>Chrun distribution w.r.t. Phone Service</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
Customers with phone service are more likely to churn
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df_yes, x="Churn", color="Partner", barmode="group", title="<b>Chrun distribution w.r.t. Partners</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
Here we see that customers with no partners are more likely to churn
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df_yes, x="Churn", color="PaperlessBilling",barmode="group",title="<b>Chrun distribution w.r.t. Paperless Billing</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
Customer who use paperless billing are more likely to churn
fig = px.histogram(df_yes, x="Churn", color="PaymentMethod",barmode="group", title="<b>Customer Payment Method distribution w.r.t. Churn</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
Major customers who moved out were having Electronic Check as Payment Method. Customers who opted for Credit-Card automatic transfer or Bank Automatic Transfer and Mailed Check as Payment Method were less likely to move out.
fig = px.box(df2, x='Churn', y = 'tenure')
fig.update_yaxes(title_text='Tenure (Months)')
fig.update_xaxes(title_text='Churn')
fig.update_layout(autosize=True, width=750, height=600,
title='<b>Churn distribution w.r.t. Tenure</b>')
fig.show()
New customers are more likely to churn
(ie., we can infer than people with atleast 10 months of using the product are more likely to churn)
from sklearn.model_selection import train_test_split
X = df.drop(columns = ['Churn'])
y = df['Churn'].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30, random_state = 40, stratify=y)
#30% will go towards testing
from sklearn.preprocessing import StandardScaler
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
scaler= StandardScaler()
# The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.
# leaving variances unequal is equivalent to putting more weight on variables and it leades to elongated clusters as KNN makes clusters.
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])
#fit_transform-
#compute the mean and std dev for a given feature to be used further for scaling and perform scaling from the mean and std dev calculated
df.shape
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error,f1_score, recall_score
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors = 8)
knn_model.fit(X_train,y_train)
predicted_y = knn_model.predict(X_test)
accuracy_KNN = knn_model.score(X_test,y_test)
recallKNN = recall_score(y_test, predicted_y)
F1KNN = f1_score(y_test,predicted_y)
mseKNN = mean_squared_error(y_test,predicted_y)
maeKNN = mean_absolute_error (y_test,predicted_y)
print("KNN accuracy:",round(accuracy_KNN,2))
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, predicted_y)
cm
#total will be the total rows
#left to right diagonal will give 1345 + 291 will give correctrf predicted rows
#and right to left diagonal will be opposite
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(criterion="entropy", max_depth =6)
dt_model.fit(X_train,y_train)
predicted_y = dt_model.predict(X_test)
accuracy_DT = dt_model.score(X_test,y_test)
recallDT = recall_score(y_test, predicted_y)
F1DT = f1_score(y_test,predicted_y)
DTmse = mean_squared_error(y_test,predicted_y)
DTmae = mean_absolute_error (y_test,predicted_y)
print("Decision Tree accuracy is :",round(accuracy_DT,2))
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, predicted_y)
cm
from sklearn.tree import export_graphviz
# generating decision tree image
export_graphviz(dt_model,out_file='iris_tree.dot',feature_names=df.columns[:19], class_names=df.columns[:], rounded=True, filled=True)
!dot -Tpng iris_tree.dot -o iris_tree.png
from sklearn import svm
SVM_cls = svm.SVC(kernel = "linear")
SVM_cls.fit(X_train, y_train)
svmPrediction = SVM_cls.predict(X_test)
accuracysvm = accuracy_score(y_test, svmPrediction)
recallSVM = recall_score(y_test, svmPrediction)
mseSVM = mean_squared_error(y_test,svmPrediction)
maeSVM = mean_absolute_error (y_test,svmPrediction)
F1SVM = f1_score(y_test,svmPrediction,average='binary')
print("Support Vector Classifier accuracy is :",round(accuracysvm,2))
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, svmPrediction)
cm
From the table below its is clear that SVM model performs better as compared to Decision tree and KNN model with an overall accuracy of 79.7% and an overall high F1-Score of 0.59. Moreover, the Mean squared error and Mean absolute error of the SVM is lower when compared to Decision Tree and KNN.
data_report = np.array([['Algorithm','Accuracy Score','F1-Score','Recall','Mean Square Error','Mean Absolute Error'],['KNN',accuracy_KNN ,recallKNN, F1KNN,mseKNN,maeKNN],['Decision Tree',accuracy_DT ,recallDT, F1DT,DTmse,DTmae],['SVM',accuracysvm ,recallSVM,F1SVM,mseSVM,maeSVM]])
df_report = pd.DataFrame(data = data_report[1:,1:], index = data_report[1:,0],columns = data_report[0,1:])
df_report
Mean squared error = (1/n) * Σ(actual – prediction)2 ; the lower the better
Mean absolute error = (Δx) = |xi – x|, = measured value - true value
The accuracy score is used as a measure to calculate the performance of a model. The confusion matrix is used to evaluate the model.
The mean squared error (MSE) determines the distance between the set of points and the regression line by taking the distances from the set of points to the regression line and then swapping them. Distances are nothing but errors. Squaring is only done to remove negative values and to give more weight to larger differences.
If the MSE score value is smaller it means you are very close to determining the best fit line which also depends on the data you are working on, so sometimes it may not be possible to get a small MSE score value.
Absolute Error is the amount of error in your measurements. It is the difference between the measured value and “true” value. For example, if a scale states 90 pounds but you know your true weight is 89 pounds, then the scale has an absolute error of 90 lbs – 89 lbs = 1 lbs.
Accuracy is a metric for classification models that measures the number of predictions that are correct
A precise model is very “pure”: maybe it does not find all the positives, but the ones that the model does class as positive are very likely to be correct
A model with high recall succeeds well in finding all the positive cases in the data, even though they may also wrongly identify some negative cases as positive cases
The F1 score is defined as the harmonic mean of precision and recall.
def label_encode_input(item,data):
if item == "gender":
data = 1 if data == "Male" else 0
elif item == "SeniorCitizen":
data = 1 if data == "Yes" else 0
elif item == "Partner":
data = 1 if data == "Yes" else 0
elif item == "Dependents":
data = 1 if data == "Yes" else 0
elif item == "PhoneService":
data = 0 if data == "No" else 1
elif item == "MultipleLines":
data = 0 if data == "No" else ( 2 if data == "Yes" else 1)
elif item == "InternetService":
data = 2 if data == "No" else ( 1 if data == "Yes" else 0)
elif item == "OnlineSecurity":
data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
elif item == "OnlineBackup":
data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
elif item == "DeviceProtection":
data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
elif item == "TechSupport":
data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
elif item == "StreamingTV":
data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
elif item == "StreamingMovies":
data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
elif item == "Contract":
data = 0 if data == "Month-to-month" else ( 1 if data=="One year" else 2)
elif item == "PaperlessBilling":
data = 1 if data == "Yes" else 0
elif item == "PaymentMethod":
data = 2 if data == "Electronic check" else ( 3 if data=="Mailed check" else (0 if data == "Bank transfer (automatic)" else 1))
else:
data = float(data)
return float(data)
Accept Input from User to predict and preprocess input:
columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents','tenure', 'PhoneService', 'MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling','PaymentMethod', 'MonthlyCharges', 'TotalCharges']
X=[]
example_input = ["Female","No","Yes", "No", 1 ,"No","No phone service","DSL","No","Yes","No","No","No","No","Month-to-month","Yes","Electronic check",29.85,29.85]
print("Please Enter the data values:")
for item in columns:
print(item," : ")
if item == "gender":
print("( 'Female' , 'Male' )")
elif item == "SeniorCitizen":
print("( 'No' , 'Yes' )")
elif item == "Partner":
print("('Yes', 'No')")
elif item == "Dependents":
print("('No', 'Yes')")
elif item == "PhoneService":
print("('No', 'Yes')")
elif item == "MultipleLines":
print("('No phone service', 'No', 'Yes')")
elif item == "InternetService":
print("('DSL', 'Fiber optic', 'No')")
elif item == "OnlineSecurity":
print("('No', 'Yes', 'No internet service')")
elif item == "OnlineBackup":
print("('Yes', 'No', 'No internet service')")
elif item == "DeviceProtection":
print("('No', 'Yes', 'No internet service')")
elif item == "TechSupport":
print("('No', 'Yes', 'No internet service')")
elif item == "StreamingTV":
print("('No', 'Yes', 'No internet service')")
elif item == "StreamingMovies":
print("('No', 'Yes', 'No internet service')")
elif item == "Contract":
print("('Month-to-month', 'One year', 'Two year')")
elif item == "PaperlessBilling":
print("('Yes', 'No')")
elif item == "PaymentMethod":
print("('Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)')")
data = input(" --> ")
data = label_encode_input(item,data)
X.append(data)
print("input Feature -> ",X)
Prediction
X[4],X[-2],X[-1] = scaler.transform([[X[4],X[-2],X[-1]]])[0]
svmPrediction = SVM_cls.predict([X])
if svmPrediction[0] == 0:
print("Customer NOT churned. -> NO")
else:
print("Customer HAS churned. -> YES")